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Abstract 

Nonnegative Matrix Factorization consists in (approximately) factorizing a nonnegative data 
matrix by the product of two low-rank nonnegative matrices. It has been successfully applied as a 
data analysis technique in numerous domains, e.g., text mining, image processing, microarray data 
analysis, collaborative filtering, etc. 

We introduce a novel approach to solve NMF problems, based on the use of an underapproxi- 
mation technique, and show its effectiveness to obtain sparse solutions. This approach, based on 
Lagrangian relaxation, allows the resolution of NMF problems in a recursive fashion. We also prove 
that the underapproximation problem is NP-hard for any fixed factorization rank, using a reduction 
of the maximum edge biclique problem in bipartite graphs. 

We test two variants of our underapproximation approach on several standard image datasets 
and show that they provide sparse part-based representations with low reconstruction error. Our 
results are comparable and sometimes superior to those obtained by two standard Sparse Nonneg- 
ative Matrix Factorization techniques. 
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1 Introduction 



Nonnegative Matrix Factorization (NMF) is a recent data analysis technique with applications in image 
processing, text mining, spectral unmixing, air emission control, computational biology, clustering, etc. 
(see [U [21 [3l 3] and references therein). NMF can be described as follows: given a nonnegative input 
matrix M S M™ xra and an integer 1 < r < min(m, ra), find two nonnegative matrices V £ M+ Xr and 
W £ W+ n whose product approximates the input matrix as closely as possible: 



M as VW, 



so that VW is a low-rank approximation of M. Matrix factorization can be interpreted as a linear 
factor model: assuming that each column of the input matrix M represents an element of a data set, 
decomposition ([I]) can be written as^] 

k 

i.e., each input column M : j is a linear combination of a set of r basis elements V± with corresponding 
weights Wkj- 

In contrast with standard linear factor model techniques such as Principal Component Analysis, 
NMF considers nonnegativity of the input columns to be an important feature and consequently 
requires the basis elements to be also nonnegative, so that they can be interpreted in the same way 
(e.g., these columns can correspond to images described by nonnegative pixel intensities or to texts 
represented by vectors of nonnegative word counts). Furthermore, NMF imposes nonnegativity of the 
weights, leading to an essentially additive reconstruction of the input columns by the basis elements. 
This representation is then part-based: basis elements V± will represent common parts of the columns 
of M-j. For example, if each column of M represents a face using pixel intensities, the basis elements 
generated by NMF can be facial features, such as eyes, noses and lips, as shown in Figure [TJ 
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Figure 1: NMF applied to the CBCL Face Database #1, MIT Center For Biological and Computation 
Learning (available at http://cbcl.mit.edu/cbcl/software-datasets/FaceData2.html). It consists of 
the approximation of 2429 gray-level images of faces represented with 19 x 19 pixels (columns of M) 
using r = 49 basis elements (columns of V). 



This low-rank approximation technique with nonnegativity constraints was introduced in 1994 by 
Paatero and Tapper [5 J and started to be extensively studied after the publication of an article by 
Lee and Seung [6] in 1999. Since an exact representation of the input matrix cannot be obtained 

1 In this text, Aij stands for the (i, j)-entry of a matrix A, A-j for its j th column and Aj : for its i th row. 
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in general, the quality of the approximation is measured by some criterion, typically the sum of the 
squares of the errors on the entries, which leads to the following minimization problem^] 

min ||M-VVF||! such that V > and W> 0. (NMF) 

VeR mxr ,WeR rxn 

An important feature of NMF is that its nonnegativity constraints typically induce sparse factors, 
i.e., factors with relatively many zero entries. Intuitively, decomposition into parts requires the basis 
elements to be sparse, cf. Figure [T] More formally, the reason for this behavior is that stationary 
points (V, W) of NMF will be typically located at the boundary of the feasible domain R™ xr x M+ Xri , 
hence will feature zero components. This can be explained with the first-order optimality conditions: 
because the set of stationary points of a problem of the type 

min f(x) such that x > 

is given by the following expression (where V/(x) is the gradient of /) 

S f = {x G R n | x > 0, V/(:c) > and Xi [Vf(x)]i = OVi}, 

some components of the solution can be expected to be equal to zero. 

Sparsity of the factors is an important consideration in practice: in addition to reducing memory 
requirements to store the basis elements and their weights, sparsity improves interpretation of the 
factors, especially when dealing with classification/clustering problems, e.g., in text mining [7 u and 
computational biology [8] [9]. By contrast, unconstrained low-rank approximations such as Princi- 
pal Component Analysis (PCA) do not naturally generate sparse factors (for that reason, low-rank 
approximations techniques with additional sparsity constraints have been recently introduced; this 
is referred to as Sparse Principal Component Analysis, Sparse PCA or SPCA, see, e.g., [10 and 
references therein). 

Although solutions of NMF typically display some level of sparsity, some applications require even 
sparser solutions, leading to variants of NMF called Sparse Nonnegative Matrix Factorization. They 
are in general developed in two different ways: some authors define a priori a desired sparsity level 
and adapt the main iteration of their method in order to guarantee that the factors satisfy that level 
of sparsity throughout the application of the algorithm, see, e.g., [T2] . Alternatively, a penalty 
term can be added to the objective function to prevent the algorithm from considering dense solutions, 
see [13]. In particular, it is well-known that Zi-norm penalty terms induce sparser solutions (see, e.g., 
|144 13 dS]). More details about these techniques are given at the beginning of Section [4j 

Unfortunately the advantages of NMF (part-based representation and sparsity) over PCA come 
at a certain price. First, because of the additional nonnegativity constraints, the approximation error 
of the input data for a given factorization rank r will always be higher for NMF than in the uncon- 



strained case. Second, optimization problem (NMF) is more difficult to solve than its unconstrained 



counterpart: while PCA problems can be solved in polynomial time (e.g., using a Singular Value 
Decomposition technique [IS]), NMF problems belong to the class of NP-hard problems, as recently 
showed by Vavasis [IT]. However, it should also be pointed out that these drawbacks (higher error, 
NP-hardness) are also present for competing techniques emphasizing sparsity, such as SPCA. 

Because of its NP-hardness, practical algorithms cannot be expected to find provably optimal global 



solutions for (NMF) in a reasonable amount of time and aim instead at finding locally optimal solu- 
tions. Most methods start from some initial guess factors (V, W) and improve them iteratively using 
nonlinear optimization schemes such as projected gradient methods [18] , Newton-like methods |19[ [20] , 
(block-) coordinate descent (also called alternating nonnegative least squares - NNLS) |21 [ I22 | [T3]. mul- 
tiplicative updates [23], etc. (see also [TJ |2] [M] [25] and references therein). 



\\A\\f = j Afj) 2 denotes the Frobenius norm of matrix A. 
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In this paper, we introduce a novel approach to solve NMF problem based on the use of an under- 
approximation technique and show its effectiveness to obtain sparse solutions. Section [2] introduces 
our underapproximation problem, motivated by a recursive technique to solve NMF, studies the spar- 
sity of its solutions and proves that it is NP-hard for any fixed factorization rank. Nevertheless, 
Section [3] describes an algorithm to solve it approximately using a technique based on Lagrangian 
relaxation. Finally, in the last section, we test this approach on several standard image datasets, and 
show both qualitatively and quantitatively that it provides part-based and sparse representations that 
are comparable and sometimes superior to those obtained with standard Sparse Nonnegative Matrix 
Factorization techniques. 



2 Nonnegative Matrix Underapproximation 

2.1 A recursive approach 



Finding a rank-one nonnegative matrix factorization, i.e., solving (NMF) with r = 1 is notably easier 
than for higher factorization ranks: while the general problem is NP-hard, computing a globally 
optimal rank-one approximation can be done in polynomial time. More specifically, the first rank- 
one factor of the Singular Value Decomposition (SVD) of the input matrix is an optimal solution: 
indeed, the Perron-Frobenius theorem implies that the dominant left and right singular vectors of a 
nonnegative matrix are nonnegative, while the Eckart- Young theorem states that the outer product of 
these dominant singular vectors is the best possible rank-one approximation in the Frobenius norm. 

In principle, we might try exploit this result to find factorizations of higher ranks by applying it 
recursively: after identification of an optimal rank-one NMF solution (v,w), one could subtract the 
vw factor from M and apply the same technique to M — vw to recover the next rank-one factor. 
Unfortunately, this idea cannot work: the difference between M and its rank-one approximation may 
contain negative values (typically roughly half of them), so that the next SVD factor will no longer 
provide a nonnegative solution. Moreover, there is no hope of replacing SVD by another efficient 
technique for this step since [26J shows that it is NP-hard to find the optimal nonnegative rank-one 
approximation to a matrix which is not nonnegative. 

If we wish to keep the principle of a recursive algorithm finding one rank-one factor at a time, we 
have to add a constraint ensuring that the vw factor, when subtracted from M, gives a nonnegative 
remainder, i.e., we need to have vw < M. Therefore we introduce a similar upper bound constraint 
VW < M to the general (NMF) problem and obtain a new problem we call Nonnegative Matrix 
Underapproximation (NMU): given M £ M™ xn and 1 < r < min(m, n), the NMU optimization problem 
is defined as 

min MM - VW\ 1 1 such that V > 0, W > and VW < M. (NMU) 

VeR mxr ,WeR rxn 

Assuming we are able to solve it for r = 1, an underapproximation of any rank can then be built by 
following the recursive procedure outlined above. More precisely, if (Va,Wi : ) is a rank-one under- 
approximation for M, i.e., V :1 Wi- M(= i?i) and V-iWi-. < M, we have that R 2 = M - V : iWi : is 
nonnegative. R2 can then be underapproximated V2W2; < R2, leading to R3 = R2 — V2W2., and so 
on. After r steps, we get an underapproximation of rank r 

M > VaWl: + V 2 W 2 : + ■ ■ ■ + V :r W r . 

= [Vi V, 2 ... V r ][W 1:] W 2 -:, ... ;W r: ] 
= VW. 



Besides enabling this recursive procedure, we notice that NMU leads to a more localized part- 
based decomposition, in the sense that different basis elements tend to describe disjoint parts of the 
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input data (i.e., involving different nonzero entries). This is a consequence of the underapproximation 
constraints which impose the extracted parts (the basis elements V : k) to really be common features of 
the columns of M since 



Basis elements can only be combined to approximate a column of M if each of them represents a 
part of this column, i.e., none of the parts selected with a positive weight can involve a nonzero entry 
corresponding to a zero entry in the input column M-j. The following example demonstrates this 
behavior. 

Example 1 (Swimmer Database). The swimmer image dataset consists of 256 binary images of a 
body with 4 limbs which can be each in 4 different positions. NMF is expected to find a part-based 
decomposition of these images, i.e., isolate different constitutive parts of the images (the body and 
the limbs) in each of its basis elements. 

Figure [2] displays a sample of such images along with the basis elements obtained with NMF and 
NMU. While NMF elements are rather sparse, they are mixtures of several limbs. By contrast, NMU 
returns a even sparser solution and is able to extract a single body part for each of its elements. 



(a) 



(b) 



I 



(c) 



Figure 2: Basis elements generated for the swimmer image dataset with r = 8: (a) Sample images 
from the dataset, (b) NMF and (c) NMU; see Section [4] for the algorithms used to compute the 
factorizations. 



2.2 Sparsity 

The fact that NMU decompositions naturally generate sparser solutions than NMF can be explained 
as follows: since the zero entries of M can only be underapproximated by zeros, we have 

M v j = => (VW)ij = =► V ik = or W kj = 0, Vfc 

which shows that when the input matrix is sparse, many components of the NMU factors will have to 
be equal to zero. This observation can be made more formal: defining the sparsity s(M) of a m by n 
matrix M as the proportion of its zero entries, i.e., 

S {M) = € [0,1], 

mn 

we have the following theorem relating sparsity of M and its NMU factors. 
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Theorem 1. For any nonnegative rank-one underapproximation (v,w) £ x R5. of M £ M™ xn w;e 

s(u) + s(w) > s(M). 

Proof. For a rank-one matrix ftu, the number of nonzeros is exactly equal to the product of the 
number of nonzeros in vectors v and w. Therefore we have that (1 — s(vw)) = (1 — s(t>))(l — s(w)) 
which implies s(vw) = s(v) + s(u>) — s(v)s(w) < s(v) + s(w). Since underapproximation vw satisfies 
< vw < M, it must have more zeros than M and we have 

s(M) < s(sw) < s(v) + s(w), 

proving our claim. □ 

Recall the recursive definition of the residuals Rk+i = Rk — V-kWk- and R\ = M. The following 
corollary relates their sparsity and the sparsity of the whole rank-r approximation with that of the 
NMU factors. 

Corollary 1. For any nonnegative underapproximation (V, W) £ M™ xr xK™ of M £ M^ ixn we have 
for each factor 

s(V k ) + s(W k: ) > s(R k ) > s(M), l<k<r, 

and s(V) + s(W) > s{M). 

Proof. We have < V±Wk- < Rk < M, which implies by the previous theorem the first set of 
inequalities. Observing that s(V) = £ s (Kfc) and s(W) = \ ^2k s ^Wk) is sufficient to prove the 
second inequality. □ 

Sparsity of the residuals Rk is monotonically nondecreasing at each step, since M = R± > R2 > 
• • • > 0. Moreover, the following theorem can guarantee an increase in sparsity at each step. 

Theorem 2. For any locally optimal nonnegative rank-one underapproximation (v,w) £ MIp x M. 7 } of 
M £ M™ xn ; define sets I and J (supports of vectors v and w) by 

I={i\ Vi > 0}, J = {j I wj > 0}, 

and define matrix R(I, J) to be the submatrix of residual R = M — vw whose row and column indices 
belong respectively to I and J (corresponding to the submatrix of M that is not approximated by zeros). 
Then there is at least one zero in each row and each column of submatrix R(I, J). 

Proof. Simply observe that if R(i,J) > (resp. R(I,j) > 0) for some i £ I (resp. j £ J), V{ 
(resp. Wj) can be increased to obtain a strictly better solution, which contradicts the local optimality 
assumption. □ 

This ability of NMU to generate sparse part-based decomposition will be experimentally confirmed 
in Section HI 



2.3 Related work 

The problem of rank-one underapproximation has been first introduced by Levin in [27] in the case of 
positive stochastic matrices. He introduced a specific objective function different from the Frobenius 
norm and used a logarithmic change of variables in order to design an iterative method based on the 
corresponding optimality conditions. 

In |28| . the rank-one underapproximation problem is cast as a convex problem (hence efficiently 
solvable) using again different objective functions. Solutions are then used to initialize standard 
NMF algorithms in order to accelerate their convergence and, in general, find better final solutions 
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as compared to those obtained with random initializations. Similar behavior was observed for other 
judicious initializations in [29 1 130 1 13T]. 

More recently, Dong et al. [32 j studied the same problem with the additional constraint that 
the rank of the residual must be strictly smaller than the rank of the factorized matrix. Using the 
Wedderburn rank reduction formula, they proposed a numerical procedure which is able to compute 
the maximum rank splitting of a nonnegative matrix. However, the underlying optimization problem 
is NP-hard [T7] and their algorithm is not guaranteed to find a solution in all cases. 

Biggs et al. [33 also introduced a recursive procedure to solve NMF problems: their idea is to 
locate and then approximate nearly rank-one submatrices of M. However, the problem of locating 
maximum rank-one submatrices is also shown to be NP-hard, and their algorithm is not globally 
optimal. 



2.4 Complexity 



We now prove that (NMU) is NP-hard, even in the rank-one case (unlike (NMF), which is polynomi- 
ally solvable in the rank-one case). In order to do this, the rank-one version of the problem is proved 
to be equivalent to the bi clique problem, which is NP-hard. The result is then generalized to (NMU) 
with arbitrary factorization rank r using a simple construction. 



A bipartite graph G b is a graph whose vertices can be divided into two disjoint sets such that there 
is no edge between two vertices in the same set. A biclique K b is a complete bipartite graph, i.e., 
a bipartite graph where all the vertices from different sets are connected by an edge. Finally, the 
so-called maximum edge biclique problem (the biclique problem for short) in a bipartite graph 

G b = (V = Ui U V 2 , E C (V! x V 2 , 

is the problem of finding a biclique K b = (V , E') in G b (i.e., V = V{UV 2 ' C V and E' = (V{ x V 2 ') C E) 
with a maximum number of edges \E'\ = \ V{\ ■ |V^|. 

Letting M G {0, l} mXTl be the adjacency matrix of G b with V\ = {s\, . . . s m } and V% = {ii, . . . t n }, 

i.e., 

Mij = 1 (si,tj) £ E, 

and introducing indicator binary variables Vi (resp. Wj) to denote whether Si (resp. tj) belongs to the 



biclique K b , the Maximum edge Biclique Problem (MBP) in a bipartite graph can be formulated as 
follows 



^2 



mm > (iW,;,- — Vj,Wj 
id 

ViWj^Mij, Vi,j, (MBP) 
ve{0,l} m , w£{0,l} n . 

One can check that this objective is equivalent to max„ jt0 £V ■ ViWj. In fact, Mij — ViWj = (Mij — ViWj) 2 
since M, v and w are binary and > ViWj. 

The corresponding decision problem 11 Given K, does G b contain a biclique with at least K edges?" 



has been shown to be NP-complete [34j. Therefore, the corresponding optimization problem (MBP) 
is at least NP-hard. 



For r = 1 , ( NMU ) can be written as 

min y (Mij — ViWj 



h3 

ViWj < M^, Vi,j , (NMU1) 
v > 0, w > 0, 
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which is very close to (MBP): the difference is that vectors v and w are required to be binary for 
(MBP) and nonnegative for (NMU1). The next lemma proves that the two problems are actually 
equivalent. 



Lemma 1. For M G {0, l} mxn ; every optimal solution (v,w) of (NMU1) is such that vw is binary, 
i.e., vw G {0, l} mxri , and can then be trivially transformed into a binary optimal solution (v',w') G 
{0,l} m x {0,l} n of ([MBP). 



Proof. For M = 0, this is trivial. Otherwise, suppose (v,w) is an optimal solution of (NMU1). Let 
define (V , w') as 

' 1 if ^° and w' = l 1 if ^° 

otherwise •? \ otherwise ' 

and analyze the different possibilities: as ViWj < Mj,-, we have either 



o M, 



1 and < ViWj < 1 



1: 



o M, 



1 and ViWj 



v\w'j 



IUJJ 



o Mjj = =^ ViWj = =^ ^iOj- = 0. 



Therefore, fito,- < < Mij which implies 

\\M - v'w'\\ F < \\M — VW\ F- 

By optimality of (v, w), we must have vw = v'w' £ {0, l} mxn . Therefore, (v',w') = {v/ max(v) , w / m&x(w)) 
is an optimal binary s 
we must have max(f ) 



is an optimal binary solution of (NMU1) which is then also an optimal solution of (MBP) (note that 

max(w)" 1 ). 



□ 



Corollary 2. flNMUl[ ) is NP-hard. 

We now generalize Corollary [2] to the more general case of (NMU) with r > 1. 



Theorem 3. (|NMU|) is NP-hard. 

1 be the adjacency matrix of a bipartite graph G^. We define the matrix A 



Proof. Let M G {0, 1} 

as 



A = diag(M, r) 



/ M 






M 



\ 




\ ... M J 

which is the adjacency matrix of another bipartite graph G r h which is the graph G\, repeated r times. Let 
(V, W) be an optimal solution of ((NMU}. Sin ce VW = ££ =1 V k W k: , we have VW <A^ V, k W k: < A. 
Therefore (Vk, W k ) is a feasible solution of (NMU1) for the matrix A, i.e., for the graph G r b . Hence, 
each (Vk, W k ) corresponds to a biclique = {Vf U V*, E h ) of G r b with 

V ik / ^ Si G Vi and W kj + ^ t j G V 2 k . 

By optimality of (V, W) and since there are at least r independent maximum biclique in G£, each 
(V :k , W k ) must coincide with a maximum biclique of G r b which corresponds to a maximum biclique 
of Gi,. This is due to the fact that, because G r b is the graph Gb repeated r times, a biclique clearly 
cannot span two disjoint subgraphs of GT. Therefore, (NMU) is NP-hard since any instance of (MBP) 



can be polynomially reduced to an instance of (NMU). 



□ 
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3 An algorithm for NMU based on Lagrangian relaxation 



Since (NMU), like (NMF), is a NP-hard problem, we can not expect to solve it up to guaranteed 
global optimality in a reasonable (e.g., polynomial) computational time (unless P = NP). In this 
section, we propose a nonlinear optimization scheme based on Lagrangian relaxation in order to com- 
pute approximate solutions of (NMU). 



Drop the mxn underapproximation constraints VW < M of (NMU) and add them into the objective 
function with the corresponding Lagrange multipliers (dual variables, forming a matrix) A 6 M™ xri , 
to obtain the Lagrangian function L(V, W, A) 

^ m n 

L(V, W,A) = -\\M-VW\\ 2 F + J2Y, A v( VW ~ 

i=l j=l 

where a factor of \ was introduced to make the presentation nicer. The Lagrangian relaxation subprob- 
lem LRa consists in minimizing L for a fixed value of the A multipliers, leading to the corresponding 
Lagrangian dual function /(A) 

f(A)=min Q L(V,W,A) (LRa) 

where /(A) is well-defined because the minimum of L(V, W,A) is always attained, due to the fact that 
/ is bounded below and the search space can be restricted to a compact set. Indeed, considering each 
rank-one factor individually {V±,Wk-) and imposing w.l.o.g. ||Vjc||p = ||Wfc : ||^ = ||VJfc||F||Wfc : ||F = 
II VJfcWfc:||F) we have 



l|V*llF = l|W-fc:llF < \\VW\\ F 

< \\M- A\\ F + \\M -A-VW\\ f , 

< 2||Af — A||jf- Vfc 

where we have used the trivial solution (V, W) = (0, 0) to bound | \M — A — VW \ \p (cf. derivations of 
Section 3.1 ). 

Standard application of Lagrangian duality tells us that 
(NMU) = in in sin* IT. A) > 



min sup L(V, W, A) 

v,w>o A>0 



sup 

A>0 



min LCV.W.A) 

v,w>o 



sup /(A), 

A>0 



where the problem on the left of the inequality is equivalent to our original NMU formulation and 
the problem on the right is its Lagrangian dual, whose solution will provide a (hopefully tight) lower 
bound on the optimal (NMU). This new problem is a nondifferentiable optimization problem with the 
nice property that its objective /(A) = miny ; ^>o L(V, W, A) is concave and its maximization (over a 
convex set) is then a convex problem (see [35] and references therein). 

We describe in the next section a general solution technique, which consists in repeatedly applying 
the following two steps: 



1. Given multipliers A, compute (V, W) to (approximately) minimize L(V, W, A), i.e., solve (LRa 
this is discussed in Section 13.11 



2. Given solution (V, W), update multipliers A; this is described in Section 3.2 
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3.1 Solving the Lagrangian relaxation problem 

The following derivations 

L(V, W, A) = J2 \{M - VW)% + *ij{VW - M) 



2 E M l - E w« + 2 £(™o « 
- E A -- riri v E A v- u o 



»i3 M 

= 1||(M-A)-FW|||-^||A|| 2 F , 

show that minimizing L( V, W, A) for a fixed A is equivalent to minimizing 1 1 (M — A) — VW\ \ F . Matrix 
N = M — A is not necessarily nonnegative, therefore finding V > and W > such that ~ VW 
is a more general problem than NMF. It is actually studied in detail in |26j (see also [36]) where it is 
called Nonnegative Factorization (NF), is formulated as 

min \\N- VW\\% such that V > and W > 0, (NF) 

v&R mxr ,weR rxn 

with N € ]R mxn and 1 < r < min(m, n) and is shown to be NP-hard for any factorization rank 
(including r = 1). 

Some standard algorithms for NMF can easily adapted to handle an input matrix that is not 
nonnegative, i.e., solve a NF problem. For this work, we decided to use a recent technique called 
Hierarchical Alternating Least Squares (HALS), proposed in [22], which alternatively updates each 
column of V and each row of W with the following optimal closed- form solutions: 

V* k = argmin^o ||(M - A) - VW\\ 2 F 

/„ A .k-YA=i^k V --iBik\ 
= maxfO, — -2= ), (2) 

with A = (M — A)W T and B = WW T ', and 

W fc * = argmin^^o ||(M-A)-FW||| 

= max(0, Cfc: " EL ^^ : ), (3) 



Dkk 

with C = F T (M - A) and D = V T V. This can be viewed as a simple method of (block-)coordinate 
descent (also called alternating variables), which has been shown to perform strikingly well in practice, 
and much better than the popular multiplicative updates of Lee and Seung (see (H [151 12S])- Under 
some mild assumptions, every limit point of the above alternating scheme is a stationary point [H 126] . 

The main computational cost of one HALS iteration is the evaluation of A and C: they each 
require 2mnr (floating point) operations. One can check that the resulting total number of operations 
is Amnr + 0((m + n)r 2 ). 

Remark 1. HALS is sensitive to the scaling of the initial matrices. For example, if the initial matrices 
V and W are chosen such that VW S§> M, optimal columns of V and optimal rows of W computed by 
formulas ^ and ^ at the first step will most likely be equal to zero. This will lead to rank deficient 
approximations (V : kWk: = for some k) and numerical problems (for V± = 0, update of Wk : is not 
well defined and vice versa). If the initial matrices (V, W) are scaled [26], i.e., by ensuring that 

1 = a r^n a \\M - aVW\\ F = ^ W V ^ } (4) 
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where (A, B) = ■ AijBij = tiace(AB T ), this behavior is in general avoided. All initial matrices 
used in the following have been scaled. 

3.2 Update of the multipliers A 

The second step of our algorithm consists in updating A in order to find better (i.e., higher) solutions 
to the Lagrangian dual problem. Using the knowledge that any optimal solution (A*, V* , W*) of the 
Lagrangian dual must satisfy the following complementarity slackness conditions 

A* (M - V*W*)ij = Vi, j, as well as feasibility conditions A?- > and (M - V*W*) lj > 0, 

we see that the update rule for the multipliers A should satisfy the following: 

o if (M — VW)ij > 0, Aij should be decreased and eventually reach zero if (M - V*W*)ij > 0, 

o if (M — VW)ij < 0, should be increased to give more importance to (M — VW)ij in the cost 
function, hopefully in order to get a feasible solution such that (M — V*W*)ij > 0. 

In the sequel, we use the following rule to update A, which satisfies the above requirements: 

A-^max(0,A- /i k (M -VW)), fi k -> 0, 

where [i k is a predefined sequence of step lengths decreasing to zero; A can be initialized to zero. This 
update is inspired from the concept of subgradient methods [37]; in fact, one can easily check that the 
quantity (VW — M) is a subgradient of 

/(A) = min L(V, W, A) 
v ; v,w>o 

with respect to A, i.e., if (V,W) = argmin^ w>0 L(V, W, A), we have 

/(A) < /(A) + (VW - M,A — A) , VA. 

Two questions now arise 

o Since an iterative algorithm is used to solve (approximately) the Lagrangian relaxation problem 



(cf. section 3.1 ), after how many of these HALS iterations do we stop and proceed to update the 



multipliers A? 
o How do we choose the sequence of step lengths /^? 



Subgradient methods usually assume that the Lagrangian relaxation problem (LRa) can be solved 
exactly and can guarantee their convergence to an optimal solution provided an appropriate sequence 
of step sizes is selected (see, e.g., [35 J, for example satisfying the conditions 

oo oo 

^ fJ>t — > such that ^^A*! < +°° while ^^A*fc = +oo. 

fc=o k=0 

In the sequel, we choose to use fj,f. = \, which is such a suitable sequence. However, in our case, 



we cannot expect to solve (LRa) in a reasonable amount of time since the problem is NP-hard. It 
would even probably be too expensive to wait for the stabilization of (V, W) (e.g., getting close to a 
stationary but not necessarily optimal point). We therefore suggest to update (V, W) only a constant 
number of times T between each update of A, which leads to Algorithm L-NMU. Note that because we 



do not solve ( LRa ) exactly, Algorithm L-NMU is not guaranteed to converge to an optimal solution 
of the Lagrangian dual but, as we will see, it produces satisfactory solutions in practice. 

The additional computational cost of one iteration of algorithm L-NMU when compared with one 
iteration of HALS for NMF consists in the computation of M — A (needed in step 3) and the update of 
A (at step 4), which require 2mnr+0(mn) operations (and, in the special case r = 1, 5mn operations). 
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Algorithm 1 Lagrangian NMU (L-NMU) 

Require: M G M^ xn , r > 0, V G M+ Xr , G M^ xr \ maxiter, T. 
Ensure: (V, s.t. UW < M. 

1: A = 0; 

2: for k = 1 : maxiter do 

3: Update (V, W) using T iterations of HALS 
4: Update A <- max(0, A — | (M — VW)); 
5: end for 



Remark 2. Because convergence is not theoretically guaranteed, Algorithm L-NMU may end up with 
solutions that do not completely satisfy the under approximation constraint. Although our numerical 
experiments show that this has no detrimental influence on the quality of the obtained sparse part- 
based representations (see Section Q, we give here a simple technique to transform such a solution 
into a feasible solution. Indeed, it is enough to consider the following QP problem (convex quadratic 
objective function, linear inequality constraints) which only involves the V factor 

V* = argmin v > oyw < M \\M - VW\\ 2 F . (5) 

Because of its convexity, this problem can be solved up to global optimality in a very efficient manner, 
and replacing the original V factor by the optimal solution V* leads to a feasible solution (V*, W) to 



(NMU). 



Remark 3. Because update rule ([5]) is exact and computable in practice, it would be natural to 
consider a simpler algorithm based on its alternative application to the V and W factors, without 



using the Lagrangian relaxation technique, hoping to converge to a solution of (NMU). Unfortunately, 
we observed that this is quite inefficient in practice. In fact, 

o it is relatively computationally expensive to solve these linearly constrained quadratic programs 
(with ran + mr and mn + nr inequalities), at least compared to the HALS closed- form update 
rules 

o since the underapproximation constraint is imposed at each step, this algorithm has much less 
freedom to converge to good solutions: iterates rapidly get stuck on the boundary of the feasible 
domain, typically with (too) many zeros and a lower rank. For example, assuming M has one 
zero in each column, we have that for any positive matrix V the corresponding optimal W is 
equal to 0: 

Vj, 3i s.t. Mtj = => Vj, 3i s.t. v ikW kj = => W kj = 0, Vfc, j. 

k 

Therefore, such an algorithm can only work if we decide a priori which values in V and W 
should be equal to zero, i.e. if we find a good sparsity pattern for the solution, which is precisely 
where the difficulty of the problem lies. Note that the same behavior is observed if a HALS-type 
algorithm is used instead of ([5]) (i.e., updating columns of V and rows of W alternatively): after 
the update of one column of V, the residual will have one zero in each row (cf. Theorem [2| which 
will prevent the other columns of V to be nonzero (except if the sparsity pattern is chosen a 
priori). 

Remark 4. The L-NMU algorithm described above is not particularly well-suited to deal with very 
sparse input matrices. In fact, one has to store a potentially dense m x n matrix with the Lagrangian 
variables A. Nevertheless, Berry et al. [38] have obtained encouraging results when applying NMU to 
sparse anomaly detection problems in text mining. Moreover, it is possible to take advantage of the 
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input sparsity pattern and design a computationally cheaper method. First note that the Lagrangian 
variables associated with a zero of M will be nondecreasing in the course of the algorithm, since 

0<(FWw«), i and(M) ii = A§> < Ag +1) , 

where superscript denotes the solution at step k. Therefore one can significantly reduce the 
computational cost by defining 

= - g {k) for all i,j s.t. A% = 0, 

where g(k) is an arbitrary positive nondecreasing function, e.g., g(k) = p k with p > 1, as is implicitly 
done in [26 . 



4 Numerical tests on image datasets 

We have argued in Section |2]that NMU is potentially able to extract a better part-based representation 
of the data and that its factors should be sparser than those of the standard NMF, at the detriment of 
the approximation error. In this section, we support these claims by reporting results of computational 
experiments involving two variants of Algorithm L-NMU on several image datasets. 

A direct comparison between NMU and NMF is not very informative in itself: while the former 
will provide a sparser part-based representation, the latter will feature a lower approximation error. 
This does not really tell us whether the improvements in the part-based representation and sparsity 
are worth the increase in approximation error. For that reason, we chose to compare NMU with two 
other sparse nonnegative matrix factorizations techniques, described below, in order to better assess 
whether the increase in sparsity achieved by NMU is worth the loss in reconstruction accuracy. 



4.1 Sparse NMF 

We selected and tested the following two sparse nonnegative matrix factorization techniques that are 
frequently used in the literature. 

1. Hoyer describes in [11] an algorithm relying on additional explicit sparsity constraints on the 
factors, enforced at each iteration by means of a projection. The approximation error is reduced 
via a combination of projected gradient and multiplicative updates. For our experiments, we 
use the MATLAB® code provided by the authoi]^] 

It should be pointed out that Hoyer is using a different definition of sparsity: for any nonzero n 
dimensional vector x, his measure of sparsity sh{x) is defined as 



sh{x) = — — G 0,1. (6 

\/n — 1 



Hence, a vector with a single nonzero entry is perfectly sparse 

sh([0...0 fe0...0]) = 1, Vfc^O, 

while a vector with all entries equal to each other is completely dense 

sh([Jfe...Jfe]) = 0, Vfc^O. 

In our experiments, we report sparsity using both the standard s(-) indicator and Hoyer's sh(- 
measure. 

3 This code was downloaded from 



http : //www . cs .helsinki . fi/u/phoyer /software .html 
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Instead of enforcing sparsity at every iteration, a sparsity-inducing penalty term can be intro- 
duced in the objective function [13]. In particular, it is well-known that adding /i-norm penalty 
terms induce sparser solutions (see, e.g., |14ll9|H~5]). and we therefore solve the following problem: 



jmn^M -VW\\ Z F + w\\V\\ l + m \\W\\ x , (sNMF) 
where ||^4||i = \Aik\ and fiy and [iw are two positive parameters controlling the sparsity of 



V and W. In order to solve (sNMF ), we use the HALS algorithm which can easily be adapted to 
handle the additional Zi-norm penalty terms (see, e.g., jHHS]). This algorithm will be referred 
to as sNMF. 



Technical details for the first technique are more complicated, but it allows the sparsity of the fac- 
tors to be chosen a priori. The second technique is conceptually simpler but requires the determination 
of appropriate penalizing parameters by other means. 



4.2 Tested algorithms 

Algorithm L-NMU proposed in Section [3] can be used to compute underapproximations for any given 
factorization rank r. This opens the possibility of building a rank-r underapproximation in several 
different ways: one simple option consists in applying algorithm L-NMU directly to the rank-r prob- 
lem - we call this method global NMU (G-NMU). Another option consists in applying the recursive 
technique outlined in the introduction, used to motivate the introduction of underapproximations. 
More specifically, this means running algorithm L-NMU successively r times to compute r rank-one 
approximations, subtracting each approximation from the input matrix before computing the next 
one - we call this method recursive NMU (R-NMU). Note that many other variants are possible (e.g., 
computing two rank-| approximations, computing | successive rank-two approximations, etc.) but 
we only tested the two above-mentioned variants, which represent two extreme cases (no recursion 
and maximum recursion). 

In both cases, our implementation of algorithm L-NMU computes two HALS steps between each 
update of the multipliers A (i.e., we fixed T = 2). Most of the computational work done in one 
iteration of L-NMU consists in computing M — A, performing the two HALS steps and updating A; 
more specifically, one can estimate the computational cost of one iteration of G-NMU to lOmnr + 
0((m + n)r 2 ) operations, while an R-NMU iteration takes 13mn + 0((m + n)r) operations (repeated 
r times in the recursive procedure). 

For each dataset, we test five nonnegative factorization algorithms: NMF based on HALS updates 
(NMF), global NMU (G-NMU), recursive NMU (R-NMU), sparse NMF with /i-penalty terms (sNMF) 
and the algorithm of Hoyer. We also report the results of a standard Principal Component Analysis 
(PC A) to serve as a reference (recall that the approximation error of this unconstrained low-rank 
approximation, computed here with a singular value decomposition, is globally minimal for the given 
rank, but that its factors are neither nonnegative, nor sparse). 



4.3 Iteration limits and CPU time 

Each of the five iterative algorithms described above requires a limit on the number of its iterations; 
these limits were chosen in order to roughly allocate the same CPU time to each algorithm. More 
specifically, the standard NMF was given a 600-iterations limit, which corresponds to the computation 
of 600 HALS updates. The sparse sNMF, based on a slightly modified HALS update, was also allowed 
600 iterations. Because a HALS update involves Amnr + 0((m + n)r 2 ) operations, we can deduce the 
following iteration budgets for G-NMU and R-NMU from the leading terms in their corresponding 
operation counts: G-NMU is allowed 600 x ^ = 240 L-NMU iterations while R-NMU can take 
600 x A « 180 iterations. 
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An exception to the equal CPU time rule was made for the algorithm of Hoyer. Results obtained 
after an amount of CPU time similar to that of the other algorithms were too poor to be compared 
in a meaningful way. Indeed, because this method is based on a projected gradient method and 
multiplicative updates (both 0{mnr) operations per iteration), which are known to converge at a 
typically much slower rate, a relatively high limit of 1000 iterations had to be fixed, although the 
resulting CPU time is then much larger than for the other methods (for example, on the CBCL 
dataset, 600 iterations of HALS took ~ 80s. while 1000 iterations of the algorithm of Hoyer needed 
~ 260s.). 



4.4 Testing methodology 

Recall we decided to test algorithms sNMF and Hoyer to assess the quality of the sparsity-accuracy 
compromise proposed by our NMU approaches. To achieve this, we decided to pit each NMU variant 
against a solution of sNMF /Hoyer featuring the same level sparsity, and compare the resulting approx- 
imation errors. We therefore report results for 8 algorithms on each dataset: PCA, NMF, G-NMU, 
sNMF with the same sparsity as G-NMU, which we denote by sNMF{ G-NMU}, Hoyer{G-NMU} , 
R-NMU, sNMF{R-NMU} and Hoyer{R-NMU}. 

In order to enforce a sparsity similar to the NMU solution in Hoyer's code, we compute the sh 
measure of the NMU factors and input it as a parameter of the method (see description in subsec- 
tion 4.1 ); note however that we could only enforce this for the sparsest of the two NMU factors^] In 
the case of sNMF, sparsity cannot be directly controlled, and penalty parameters are found using the 
following adaptive procedure, which proved to work well in practice: fiy and fiw are initialized to 0.1 
and, after each iteration, fiy (resp. nw) is increased by 5 percent if s(V) (resp. s(W)) is below the 
target sparsity, and is decreased by 5 percent otherwise. 

All algorithms were run 10 times with the same initial random matrices and only the best solution 
with respect to the Frobenius norm of the error is reported. When testing with gray-level images, the 
input matrices M where normalized to have their entries varying between and 1, with representing 
white and 1 representing black (when trying to decompose M as a sum of parts, this make more sense 
than the opposite convention, since the dark regions are the constitutive parts of the objects in the 
image datasets we analyze). Finally, before computing reported sparsity measures of the factors, any 
sufficiently smalF] entry is rounded to zero (indeed, because algorithms are stopped by the iteration 
limit before convergence, true zeros are typically not all reached). All tests were run within the 
MATLAB® 7.1 (R14) version, on a 3GHz Intel® Core™2 Dual CPU PC. 



4.5 CBCL Face Database 

The CBCL face image dataset was used for the illustrative example of Figure [T] and is made of 2429 
gray-level images of faces represented with 19 x 19 pixels. We look for an approximation of rank 
r = 49. 

Figure [3] displays the basis elements for NMF, G-NMU, R-NMU and sNMF{G-NMU} (which was 
the best solution obtained in term of sparsity vs. error among all four sNMF and Hoyer variants). 
Both G-NMU, R-NMU and sNMF achieve a better part-based representation than NMF, generating 
sparser solutions. An interesting feature of R-NMU is that it extracts parts successively in order of 
"importance": the first basis element is a 'mean' face (which is dense) while the next ones describe 
different complementary parts (which become sparser as the recursion moves on, cf. Corollary [T] and 
Theorem [2). 

4 Ideally, we would have imposed sparsity for both factors, but the implementation we used seemed to return poor 
results in that situation. 

5 We declare an entry of a factor to be sufficiently small if it is less than 0.1% of the largest entry in its column. 
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Figure 3: Basis elements (V±) generated for the CBCL image dataset: (a) NMF, (b) G-NMU, (c) R- 
NMU and (d) sNMF with sparsity of G-NMU. 



A more quantitative assessment is provided in Table [TJ reporting for the 8 algorithms tested the 
relative error (in percent) of their solutions 



relative error 



|M- VW||j 
\\M\\f 



in the second column ("Plain") and the corresponding sparsity measures (in percent) of factors V and 
W in the last four columns. 





Plain 


Improved 


s(V) 


s(W) 


sh{V) 


sh(W) 


PCA 


7.43 


7.43 








22 


22 


NMF 


8.12 


8.11 


56 


11 


66 


22 


G-NMU 


12.45 


8.76 


74 


14 


74 


21 


sNMF{ G-NMU} 


8.68 


8.44 


74 


14 


74 


30 


Hoyer{ G-NMU} 


9.33 


8.78 


69 


6 


73 


16 


R-NMU 


16.42 


10.89 


53 


52 


63 


64 


sNMF{R-NMU} 


10.23 


9.49 


50 


50 


56 


57 


Hoyer{R-NMU} 


8.83 


8.56 


54 


12 


64 


22 



Table 1: Comparison of the relative approximation error and sparsity for the CBCL image dataset. 



As expected, PCA returns the smallest error, albeit with very dense factors. NMF already features 
much sparser factors (slightly half of the entries in V are equal to zero), at the cost of a relatively 



16 



modest increase in the approximation error (7.43 — * 8.12). G-NMU provides an even sparser solution 
(three quarters of zero entries), increasing again the approximation error (8.12 — * 12.45). The factors 
recursively computed by R-NMU are in comparison not as sparse: as explained above, this is because 
R-NMU focuses on obtained representative parts, including relatively dense ones for the first few steps 
of the recursion. However, it features a much sparse weight vectors, giving again more credit to the 
hypothesis that better parts are extracted. The corresponding approximation error is higher than for 
other methods, because the intrinsically greedy approach taken by R-NMU is not as efficient as a 
method that optimizes all the factors simultaneously. 

Is the increased sparsity provided by G-NMU worth the increase in approximation error ? Looking 
at the corresponding results for sNMU{G-NMU} and Hoyer{G-NMU}, i.e., for sparse NMF and 
Hoyer's algorithms with a similar target sparsity, it might seem at first that the answer is negative: 
the other methods return solutions with similar number of nonzeros (slightly higher for Hoyer) and a 
lower approximation error (8.68 and 9.33 instead of 12.45). Actually, this was expected: because it 
tries to return an underapproximation, i.e., factors such that VW < M, the entries in the error term 
M — VM are mostly nonnegative, while the other techniques, with no underapproximation constraint, 
obtain a smaller norm of the error by choosing the entries of M — VW to be roughly half negative, 
half positive. It is therefore not completely fair to compare directly the error of the NMU approach 
to the other techniques. 

In order to compensate for this, a simple rescaling could be used, i.e., multiplying VW by a scalar 
since VW < M (cf. Equation Q). However, we chose a different procedure that has the advantage of 
benefiting all algorithms, including those whose error was not suffering from the underapproximation 
constraint. Once a solution is computed by one of the eight algorithms, we fix the zero entries of V 
and W and optimize the approximation error, i.e., rainy - w>o ll-^ - ^WIIfj on ^ ne remaining (nonzero) 
entries (again, HALS can easily be adapted to handle this situation). In essence, this allows us to 
compare the sparsity patterns of the different solutions. We perform 100 additional HALS steps 
on each solution, and report the new relative approximation error in the third column of Table [T] 
("Improved"). Note that sNMF and Hoyer's errors are also improved by this procedure; this can 
be explained by the fact that they were also not directly trying to minimize the approximation error 
(Hoyer had to take into account its sparsity constraint, and sNMF was influenced by the penalty terms 
added to the approximation error). 

Looking now at the NMU solutions in a fairer comparison, we observe that their approximation 
error becomes very close to that of sNMF and Hoyer, in particular for G-NMU, and not very far from 
the denser NMF, so that we can conclude that the sparsity- approximation error compromise it offers 
is worthwhile. 

4.6 Swimmer Database 

For the swimmer image dataset described in Example [T] (256 images with 20 x 11 pixels), the 8 
basis elements obtained with the different algorithms are displayed on Figure [4] and the corresponding 
approximation errors and sparsity measures are reported in Table [2] 

As mentioned earlier, our two NMU algorithms are the only methods able to extract truly inde- 
pendent parts, while NMF and sNMF generate a combination of them. Note however that the solution 
generated by sNMF bears some similarity to the one of G-NMU. 

4.7 Hubble Space Telescope Spectral Images 

The next image dataset consists of 100 spectral images (128 x 128 pixels) of the Hubble telescope 
at different frequencies j39l ED], see Figure [5] With the choice r = 8, NMF generates a nearly exact 
factorization (relative error 0.29%), because the spectral reflectance of the Hubble telescope results 
from the additive linear combination of the reflectance of eight constitutive materials. Figure [6] and 
Table [3] provide the visual and computational results for this dataset. 



17 



1 




X 


1 

T 


T 




X 


-J 

1 


r 


r 




i 


1 


r 




1 



(b) 



I 




1 


1 




X 




X 


(c) 


.-■ 




$ 


■I 


1 

■ 






l_ 



(d) 



Figure 4: Basis elements for the swimmer image dataset: (a) NMF, (b) G-NMU, (c) R-NMU and 
(d) sNMF with sparsity of R-NMU. 



Error 


Plain 


Improved 


s(V) 


s(W) 


sh(V) 


sh(W) 


PCA 


37.98 


37.98 


77* 





67 


17 


NMF 


40.41 


40.41 


84 


45 


73 


67 


G-NMU 


47.70 


46.85 


94 


75 


85 


78 


sNMF{G-NMU} 


50.52 


42.04 


89 


66 


84 


73 


Hoyer{ G-NMU} 


42.04 


41.91 


90 


45 


80 


63 


R-NMU 


50.92 


50.71 


98 


66 


93 


65 


sNMF{R-NMU} 


41.66 


41.17 


85 


66 


80 


79 


Hoyer{R-NMU}** 


/ 


/ 


/ 


/ 


/ 


/ 



Table 2: Comparison of the relative approximation error and sparsity for the swimmer image dataset. 
*This value is very close to the percentage of zero rows in the matrix M (corresponding to pixels that 
are equal to zero in all images): in general, PCA factors feature a zero component when all the entries 
of either one row or one column of the input matrix are equal to zero. **When imposing the sparsity 
level of R-NMU (sh(V) = 0.93), Hoyer's algorithm was not able to converge, probably because it is 
not well adapted to handle high sparsity constraints. Note that sNMF is also sensitive to high sparsity 
requirements: high penalty terms sometimes lead to optimal zero factors {V-k = for some k), which 
had to be reinitialized. 



Because NMF is already a nearly exact reconstruction (Table [3]) , the NMU constraints are some- 
how redundant: NMF and G-NMU are basically equivalent and return solutions with very similar 
sparsity measures (albeit with a slightly lower error for G-NMU). For that reason, sNMF{G-NMU} 
and Hoyer{G-NMU} return results nearly identical to NMF and are omitted from the table. 

Recursive R-NMU extracts parts in order of importance: first, a global picture of the telescope and 
then its different constitutive parts. This allows it to generate the sparsest solution, with several basis 
elements representing well-delimited constitutive parts of the telescope not identified by the other 
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Error 


Plain 


Improved 


s(V) 


s(W) 


sh(V) 


sh(W) 


PCA 


0.01 


0.01 


57 





62 


25 


NMF 


0.29 


0.29 


64 


5 


57 


35 


G-NMU 


0.52 


0.11 


64 


4 


60 


31 


R-NMU 


3.74 


1.37 


79 


30 


71 


62 


sNMF{R-NMU} 


0.48 


0.37 


73 


28 


66 


64 


Hoyer{ R-NMU} 


0.77 


0.68 


75 





71 


12 



Table 3: Comparison of the relative approximation error and sparsity for the Hubble telescope image 
dataset. 



methods. 
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4.8 Kuls Illuminated Faces 



A static scene was illuminated from many directions with a moving light source to produce the Kuls 
image dataset]^] It consists of 20 images (64 x 64 pixels) of a face. Because the images are very similar, 
most of the information (more than 70 percent) can be expressed with only one factor. The remaining 
information resides in the different orientations of the lighting. Computational and visual results 
for a rank-5 factorization are given by Table [4] and Figure [7} We observe that NMF and G-NMU 
obtain similar results: even though they are both able to extract several faces with different lighting 
orientations, they do not extract a sparse and part-based representation. 

R-NMU first extracts a face illuminated from all directions, and then complementary parts repre- 
senting different orientations of the lighting (successively on the fourth row of Figure [7J global then 
light from the right, left, bottom and top). This nice recursive extraction of the information is a 
direct consequence of the underapproximation constraints. Although sNMF (with the same sparsity 
requirement as R-NMU) is also able to extract a part-based representation with a slightly better ap- 
proximation error, only two components are well-identified (left and right lighting mixed with top and 
bottom lighting). 



Error 


Plain 


Scaled 


Improved 




s(V) 


s(W) 


sh(V) 


sh(W) 


PCA 


4.36 


4.36 


4.36 










23 


15 


NMF 


4.38 


4.38 


4.38 




1 


7 


9 


38 


G-NMU 


6.27 


5.77 


4.49 




3 


20 


8 


48 


sNMF{G-NMU} 


4.42 


4.42 


4.41 




2 


20 


8 


47 


Hoyer{ G-NMU} 


4.60 


4.60 


4.71 




2 


25 


8 


53 


R-NMU 


8.13 


7.84 


5.73 




29 


31 


38 


67 


sNMF{R-NMU} 


5.24 


5.24 


5.01 




29 


31 


32 


59 


Hoyer{R-NMU} 


6.82 


6.82 


6.54 







71 


6 


92 



Table 4: Comparison of the relative approximation error and sparsity for the Kuls image dataset. 



5 Conclusion 

In order to solve the NMF problem in a recursive way, we have introduced a new problem, namely 
Nonnegative Matrix Underapproximation (NMU), which was shown to be NP-hard using its equiv- 
alence with the maximum-edge biclique problem. The additional constraints of NMU are shown to 
induce sparser factors and to lead naturally to a better part-based representation of the data, while 
keeping a fairly good reconstruction. We proposed an algorithm based on Lagrangian relaxation to 
find approximate solutions to NMU. 

We tested two factorization methods based on this algorithm, one with full recursion (R-NMU), the 
other without recursion (G-NMU), on several standard image datasets. After suitable post-processing, 
we observed that the factors computed by these methods indeed offer a good compromise between 
their achieved sparsity and the resulting approximation error, comparable or sometimes superior to 
that of two standard sparse matrix factorization techniques. 

These two variants can be contrasted in the following way: where G-NMU mainly focuses on 
finding sparse factors with small reconstruction error, in the same spirit as sNMF and Hoyer, R-NMU 
typically computes an even sparser factorization corresponding to a better part-based representation, 
albeit with a moderate increase in the reconstruction error (due to the greedy approach). Moreover, 
this second variant is useful in situations where the factorization rank is not fixed a priori: the fact 



Available at http : / /www . robots . ox . ac . uk/~amb/ 
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Figure 7: Basis for the Kuls image dataset, from top to bottom: sample of images, NMF, G-NMU, 
R-NMU, sNMF with sparsity of R-NMU. 



that it is recursive allows the user to stop the procedure as soon as the reconstruction error becomes 
satisfactory, without having to recompute a completely different solution from scratch every time a 
higher-rank factorization needs to be considered. 
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